Project creative extension

In this notebook can be found the P4 milestone of the project, a creative extension of the paper Comparing Random Forest with Logistic Regression for Predicting Class-Imbalanced Civil War Onset Data by David Muchlinski, David Siroky, Jingrui He and Matthew Kocher.

The research questions we are trying to answer with this notebook are:

Here is the abstract of the content of this notebook, as well as the datasets and methods which will be used.

Abstract

The takeaways from the original paper are that algorithmic approaches offer a higher predictive power of civil war onset than traditional techniques, an asset used to corroborate some previous causal conclusions, and to question others. We propose to further exploit the derived algorithmic model in order to analyze some temporal aspects of these conflicts. Namely, we will study the evolution of the best predictors across the years. Also, we will tackle the arguably as-important issue of the ending of these conflicts, to gain an insight into the enablers of peace. To do so we will manipulate the data, and leverage the same algorithmic approach to predict ending. Finally, we will use another model, namely artificial neural networks, to study if predictive accuracy can be enhanced further. In the same spirit of the paper, permutation importance will then be used to strengthen or emit new hypotheses on our causal analysis.

Dataset:

Methods:

Table of contents:

The material found in this notebook:

  1. Setup
  2. Utilities
  3. Dataset wrangling
  4. Dataset exploration
  5. Prediction of civil war onset
  6. Analysis of change of civil war onset over the years
  7. Prediction of civil war end

Setup

Utilities

These are functions used throughout the notebook for our analysis. The most important one of which is the pipeline used for the comparison between Random Forest and the Multilayer Perceptron Approach.

Data wrangling

Dataset extension

For plotting the world maps, we extend the dataset with the alpha3 country codes, and the country names.

Civil war onset dataset definition

For our entire analysis, we will use the features selected in the Amelia dataset. These features were selected in order to make some analysis on the causal aspect of civil war. Therefore, we decided to select the same features, since we are interested in the civil war onsets causes and civil war end causes. However since this dataset has some missing values and is a bit more dirty than the dataset from Sambanis, we will use the latter, but with the features of the Amelia dataset.

For the prediction of civil war onset, we will use the variable warstds, which we add to the dataset.

Verification of data validity

In this first part, we will have a look at the dataset, and verify that the data seems valid. We will also look at the features and discard some non-interesting features.

Firstly, we will quickly check if there is no faulty data.

There does not seem to be missing data. Let's have a look at if the -1 values are normal.

It seems that for all the variables, the values can be negative and positive, and are distributed on the whole range of possible values, so it seems normal that there are -1 values.

Next we will investigate if there are no out of range values.

There does not seem to be values out of range, abnormally big values. However, now in the dataset, there could be variables which could be redundant. We will try to see and keep only 1 of those variables. Also, some variables could have no valuable information for this analysis.

For the variables which appear also as squared, we will only keep the squared ones. This affects the sxpnew variable and the pol4 variable. We will drop numlang, the number of languages variable, since the linguistic fractionalization is already incorporated in the dlang variable. Between durable and proxregc, only the latter one will be kept since being some preprocessed feature.

Civil war end dataset definition

Next we will verify if this dataset provides some variable for civil war duration and prepare the dataset for civil war end prediction.

atwards is the variable indicating if a country was at war in a given year. With some data wrangling, atwards could be a good variable for civil war duration indication. But it is not very precise ( only years ).

The duration of civil wars in years is rather small, and so the time granularity might not be sufficient to produce good estimates of duration. Instead, we will predict, for a given year, if the war has stopped. To that end, we add a new variable, warend, which is 1 if the war has ended this year (atwards switched from 0 to 1 this year), and 0 if the war is still ongoing (atwards is 1). Note that hear we do not drop the variables that were selected for the onset prediction. The reason why is that the prediction is significantly better this way for civil war ending. The prediction of civil war ending is in fact the subject of discussion (on the order of causality) in the end of this project, and thus it does not really matter if the causal factors are chosen well in advance.

Dataset exploration

In order to get a better of sense of the data at hand and how the civil war onsets are spread across the world, we perform some data exploration here. We will use world maps also later on in the project in order to study wether the predictors found indeed show up on the maps. The idea here is to be convinced that civil wars are still frequent in modern times, and that it is an issue that must be dealt with.

Map of number of civil war onsets

Here is the map of the number of civil war onsets a country has seen since 1945.

The world regions most struck by civil wars are Latin America, Eastern Africa and the Middle East. Some countries, such as Nicaragua India and Iraq have seen up to 3 civil wars since 1945. Seeing such results, geographical, economic and temporal patterns should influence a country going into civil war or not.

Map of number of years at war

Here is the map of the number of years a country has been in civil war (can span different civil wars).

Some countries, such as Colombia, have been in civil war for decades. Those are also not necessarily the same ones that have seen the most onsets. Seeing how long civil wars can last, and knowing how destructive they are, the importance of understanding their causal factors is evident.

Prediction of civil war onset

In this part, we will firstly predict civil war onset using the 2 methods of interest, Random Forest and Neural Networks. We will then study the predictive power of the two methods by plotting their ROC curves, along with the AUC score. This methods was also used by the authors. The ROC curve is useful in the context of class imbalance since we do not know the threshold for good classification a priori. This metric also gives a good idea of the robustness of the prediction. We will also perform dimensionality reduction on the activations of the last hidden layer of the MLP in order to get a better insight in how this type of model is able to learn non-linear mappings between features and prediction. This kind of approach would also be useful when using inputs of different modalities (visual, temporal). We will then look at the feature importances using permutation importance on a test set. Finally, we will plot some of these important variables on a world map for both values of the dependent variable.

Train models for civil war onset

Note that the output of this cell is removed (i.e. by running a blank cell and pasting the code) in order to eliminate the verbose that makes the notebook very long.

Predictive power comparison

The ROC curves yield good results for both approaches, with a slight advantage for RF. In all cases, both approaches are better than the traditional statistical approaches presented in the original paper. As we will see in the rest of the project, MLPs are less powerfull at handling imbalanced data than RFs.

MLP activations analysis

Here are the activations of the last hidden layer of the MLP for the dataset. We also use two controls - the features and untrained activations - in order to put into perspective the results, as it is often with ANNs that even untrained networks can have well clustered activations due to the input distribution. Here we see that indeed the trained MLP is able to group the civil war onset datapoints (top-right) better than the controls. Still we see that the cluster overlaps with some negative datapoints. This shows that the MLP is able to extract some mapping from the features in order to group somewhat separately the two events (civil war onset and no onset) to make a prediction.

Feature importance comparison

This plot shows the importance of each of the features using permutation importance. Permutation importance on a test set informs on the features that are importance for generalization. Here, two things are important: some features are very important for both models, and some features have contradicting importance even if the two models have similar predicting power. This last points highlights that talking about causality when using machine learning techniques is risky, as different models can leverage differently the features for prediction. However, some features are clearly very important for both models, such as primary commodity exports/GDP squared sxpsq, trade as percept of GDP trade, autonomous regions autonomy, rough terrain lmtnest, percentage of illiteracy illiteracy, and military power milper. This analysis corroborates some of the results of the original paper with the new model. However, some features such as the GDP growth gdpgrowth show contradicting predictive power on the test set, which goes against what the original paper shows. Also, whether the state is new or not nwstate shows high predictive power in both models, which is novel from the original paper.

Map of primary commodity exports/GDP squared sxpsq

Here we plot one of the most important features for both values of warstds in order to see if countries indeed show different values.

We clearly observe than indeed, countries experiencing a civil war onset have a noteworthy decrease in primary commodity exports/GDP. Peru for instance, has seen, on average, a drop from 0.05 to 0.006, nearly an order of magnitude less.

Map of illiteracy

Here we plot one of the most important features for both values of warstds in order to see if countries indeed show different values.

Similarly, we see that states experiencing a civil war onset have higher rates of illiteracy. One staggering example is that of Afghanistan, where illiteracy increased from 34% to 83% at the time of the onset.

Analysis of change of civil war onset over the years

Important features change along the years

As we saw before in the dataset exploration graph, the features of civil war onsets change during the years. In this part we will try to see if there is a difference in the causal aspects of civil war onset over the years. For this we will analyze the feature importances. We will train Random Forest Classifiers and Neural Networks as civil war onset predictors on bins of data of 15 years, evenly spaced of 5 years.

Firstly, we separate the data into bins.

Next we will train the predictors, and get the feature importances. Note that the output of this cell is removed (i.e. by running a blank cell and pasting the code) in order to eliminate the verbose that makes the notebook very long.

After the prediction, we want to see the performance of each predictor. Since the data is sparse in the bins, the predictors are a bit worse than the main predictor for all the data, especially for some periods.

And now we can finally plot the importance features variation along the years. In the first plot, we can see the feature importance on a linear scale, including negative values. This gives an idea on how reliable the feature importances for each bins are. The poor performance models will tend to have less reliable feature importances, and therefore have a lot of negative feature importances. Je sais pas si on met ce plot dans la datastory. The second plot is on a logarithmic scale. This is used to compare all the feature importances. We are not interested in the negative ones, since it only indicates the model is not very good.

Clustering of the civil war onsets

In the previous analysis, by making slices of 15 years, we are imposing a period on which there could be a similarity in the feature importance. Now the idea is to cluster the events using the features in order to see if it gives rise to arbitrary periods where the features are similarly important.

We will select the important features found in the analysis before, in order to permit temporally significant clustering. See below the selected top 20 features. Since the Random Forest models are generally better, we will take all the features which have a permutation importance higher than 0.005 for the Random Forest model. We will add to this the top features of the Multilayer Perceptron models which were not top features in the Random Forest model to complete to 20 features.

Here we want to see which features are binary. For the clustering, the continuous features will be normalized, and the binary features will be kept as they are. For this we inspect the following variables, suspected to be binary.

In this part, we define the features to keep, the binary features and we scale the continuous features to prepare the dataset for clustering.

In order to reduce the dimensionality, and to get better clustering, we will apply Principal component analysis, and keep 90% of the explained variance. This reduces the number of components to 12 or 13 à vérifier.

We will firstly use the OPTICS algorithm, which is a variation of the DBSCAN algorithm, which chooses the epsilon parameter automatically, We will use it because of its small number of hyperparameters and ability to cluster non globular clusters. We should choose an appropriate distance metric, it seems to work good with the correlation metric, which computes the correlation between vectors. Then minPts should be chosen. For this we will plot the silhouette score and the number of outliers the algorithm finds. We do not want too much outliers in the clustering.

The silhouette score is not defined when the clustering results in 1 cluster, therefore the line is not defined in some parts. The clustering is not very good and yields in a lot of outliers. However, a minimum number of samples which yields good results is 19, which results in 2 clusters. Next we will explore clustering with Kmeans.

Here the clustering is a bit better. The best number of clusters seems to be 11. However, it still does not show very high silhouette scores, which means the data does not easily separate into clusters.

Next we will visualize our analyzed clusters using t-SNE.

Now that we have clusters, we want to see how the years of the events in theses clusters are distributed. For this we will firstly analyze the mean of each cluster, and compute the 90% confidence interval of the year values in each cluster.

Here is presented the time of the civil war onsets, grouped into the clusters found previously.

We see on this plot, that the clusters actually range over quite some periods. For some clusters, they are quite narrow in the time range. This would indeed mean that we found some civil war onsets which had similar causes due to their period of onset. However, the clusters are not timely separated. This could be due to various things. Firstly, we found similar causes in the civil war onsets, and these causes could spread over a wide range of years, or these causes could have no relation with temporality. It could also be that the data is not easily clusterable, and the clusters found are not very reliable.

Next are the violin plots done with plotly.

Prediction of civil war end

In this part, we will firstly predict civil war ending using the 2 methods of interest, Random Forest and Neural Networks. The idea is to predict, during war, if the war will end this year. We will then study the predictive power of the two methods by plotting their ROC curves, along with the AUC score. This methods was also used by the authors. The ROC curve is useful in the context of class imbalance since we do not know the threshold for good classification a priori. This metric also gives a good idea of the robustness of the prediction. We will also perform dimensionality reduction on the activations of the last hidden layer of the MLP in order to get a better insight in how this type of model is able to learn non-linear mappings between features and prediction. This kind of approach would also be useful when using inputs of different modalities (visual, temporal). We will then look at the feature importances using permutation importance on a test set. Finally, we will plot some of these important variables on a world map for both values of the dependent variable.

Train models for civil war end

Note that the output of this cell is removed (i.e. by running a blank cell and pasting the code) in order to eliminate the verbose that makes the notebook very long.

Predictive power comparison

The ROC curves yield good results for the RF approach, and rather poor results for the MLP (yet better than random). This result is probably due to the difficulty of MLP to manipulate class imbalance, even when some weightings are explicitly added in the lost function, like we did. Random Forests, on the other hand, are very powerful at doing so. In the context of this project, we also experimented with Autoencoders, which are good alternatives for dealing with class imbalance. Here, the "anomaly" (classical use of autoencoders) was the end of civil war. The ROC curve for different values of the reconstruction error threshold was actually better than for the MLP. However we did not wish to present the results here for one reason: interpreting feature importance, which is the goal of this project, is not straightforward since the model is trained only on negative examples and would have needed much more work to relate to causality (one idea was to use the features yielding the most reconstruction error for the positive case, the issue being that the features found were very different from the RF model). However this could be a nice extension to this project.

MLP activation analysis

Here are the activations of the last hidden layer of the MLP for the dataset. We also use two controls - the features and untrained activations - in order to put into perspective the results, as it is often with ANNs that even untrained networks can have well clustered activations due to the input distribution. Here we see that the trained MLP is able to group the civil war onset datapoints (bottom-right) better than the controls. Still we see that the cluster overlaps with some negative datapoints. This shows that the MLP is able to extract some mapping from the features in order to group somewhat separately the two events (civil war end and no end) to make a prediction.

Feature importance comparison

Here we see that even if the MLP has less predictive power, the feature importances are very similar. Similarly to civil war onset, economical factors are again very important. This last points needs to be put into perspective: the reason is that it is difficult, because of how the dataset is made (in bins of years), to know what is the order of causality here: is the ending of the war due to large numbers of primary commodity exports/GDP, boosting the economy and assuaging the minds leading to the end of the conflict, or are the good numbers following the end of the war, since the economy can now thrive again. We cannot know this from this dataset, but looking at smaller time bins (months) would help us in understanding the ending of conflicts better.

Map of sxpsq

It is clear here that primary commodity exports/GDP are on the rise in years where the civil war stops. For instance, Mali shows an increase on average from 0.01 to 0.05 (5x increase). Is it due to the end of the civil war, or has this triggered the end of the civil war? Good question!